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Abstract. Graphical model learning and inference are often performed using 
Bayesian techniques. In particular, learning is usually performed in two separate 
steps. First, the graph structure is learned from the data; then the parameters of 
the model are estimated conditional on that graph structure. While the probability 
distributions involved in this second step have been studied in depth, the ones used 
in the first step have not been explored in as much detail. 

In this paper, we will study the prior and posterior distributions defined over 
the space of the graph structures for the purpose of learning the structure of a 
graphical model. In particular, we will provide a characterisation of the behaviour 
of those distributions as a function of the possible edges of the graph. We will 
then use the properties resulting from this characterisation to define measures of 
structural variability for both Bayesian and Markov networks, and we will point 
out some of their possible applications. 

Keywords: Markov Networks; Bayesian Networks; Random Graphs; Structure 
Learning; Multivariate Discrete Distributions. 

Graphical models (Pearl 1988; Lauritzen 1996) stand out among other classes of 
statistical models because of their use of graph structures in modelling and performing 
inference on multivariate, high-dimensional data. The close relationship between their 
probabilistic properties and the topology of the underlying graphs represents one of 
their key features, as it allows an intuitive understanding of otherwise complex models. 

In a Bayesian setting, this duality leads naturally to split model estimation (which 
is usually called learning) in two separate steps (Cowell et al. 2007). In the first step, 
called structure learning, the graph structure Q of the model is estimated from the 
data. The presence (absence) of a particular edge between two nodes in Q implies the 
conditional (in)dependence of the variables corresponding to such nodes. In the second 
step, called 'parameter learning, the parameters of the distribution assumed for the 
data are estimated conditional to the graph structure obtained in the first step. If we 
denote a graphical model with A4, so that M. = (0,0), then we can write graphical 
model estimation from a data set V as 



Furthermore, following Heckerman et al. (1995), we can rewrite structure learning as 



¥{M\V) = Y{g\V)Y{Q\g,V). 



P@|2?)ocP(0)P(X>|S). 



(1) 
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The prior distribution P(G) and the corresponding posterior distribution P{Q | V) are 
defined over the space of the possible graph structures, say G. Since the dimension of G 
grows super-exponentially with the number of nodes in the graph (Harary and Palmer 
1973), it is common practice to choose 

P(0) = ~ for every Q e G (2) 

as a non-informative prior, and then to search for the graph structure Q that maximises 
P(G | T>). Unlike such a maximum a posteriori (MAP) approach, a full Bayesian analysis 
is computationally unfeasible in most real-world settings (Friedman et al. 1999a; Kollcr 
and Friedman 2009). Therefore, inference on most aspects of P(G) and P(Q \T>) is 
severely limited by the nature of the graph space. 

In this paper, we approach the analysis of those probability distributions from a dif- 
ferent angle. We start from the consideration that, in a graphical model, the presence 
of particular edges and their layout are the most interesting features of the graph struc- 
ture. Therefore, investigating P{Q) and P{Q \ V) through the probability distribution 
they induce over the set £ of their possible edges (identified by the set of unordered 
pairs of nodes in Q) provides a better basis from which to develop Bayesian inference on 
Q. This can be achieved by modelling £ as a multivariate discrete distribution encoding 
the joint state of the edges. Then, as far as inference on Q is concerned, we may rewrite 
Equation 1 as 

P(0(£)|X>)ocP(0(£))P(X>|g(£)). 

As a side effect, this shift in focus reduces the effective dimension of the sample space 
under consideration from super-exponential (the dimension of G) to polynomial (the 
dimension £) in the number of nodes. The dimension of the parameter space for many 
inferential tasks, such as the variability measures studied in this paper, is likewise re- 
duced. 

The content of the paper is organised as follows. Basic definitions and notations are 
introduced in Section 1. The multivariate distributions used to model £ are described 
in Section 2. Some properties of the prior and posterior distributions on the graph 
space, P(Q(£)) and P(Q(£) \ T>), are derived in Section 3. We will focus mainly on those 
properties related with the first and second order moments of the distribution of £ , and 
we will use them to characterise several measures of structural variability in Section 
4. These measures may be useful for several inferential tasks for both Bayesian and 
Markov networks; some will be sketched in Section 4. Conclusions are summarised in 
Section 5, and proofs for the theorems in Sections 2 to 4 are reported in Appendix 1. 
Appendix 2 lists the exact values for some quantities of interest for P(C?(£)), computed 
for several graph sizes. 

1 Definitions and notations 

Graphical models (Lauritzen 1996; Pearl 1988) are a class of statistical models which 
combine the rigour of a probabilistic approach with the intuitive representation of re- 
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lationships given by graphs. They are composed by a set X = {Xi, . . . , X n } of random 
variables describing the data T> and a graph Q = (V, E) in which each vertex or node 
v e V is associated with one of the random variables in X. Nodes and the corresponding 
variables are usually referred to interchangeably. The edges e e E are used to express 
the dependence relationships among the variables in X. Different classes of graphs ex- 
press these relationships with different semantics, having in common the principle that 
graphical separation of two vertices implies the conditional independence of the corre- 
sponding random variables (Pearl 1988). The two examples most commonly found in 
literature are Markov networks (Whittaker 1990; Edwards 2000), which use undirected 
graphs (UGs, see Diestel 2005), and Bayesian networks (Neapolitan 2003; Korb and 
Nicholson 2010), which use directed acyclic graphs (DAGs, see Bang- Jensen and Gutin 
2009). In the context of Bayesian networks, edges are often called arcs and denoted 
with a e A; we will adopt this notation as well. 

The structure of Q (that is, the pattern of the nodes and the edges) determines 
the probabilistic properties of a graphical model. The most important, and the most 
used, is the factorisation of the global distribution (the joint distribution of X) into a 
set of lower-dimensional local distributions. In Markov networks, local distributions are 
associated with cliques (maximal subsets of nodes in which each element is adjacent 
to all the others); in Bayesian networks, each local distribution is associated with one 
node conditional on its parents (nodes linked by an incoming arc) . In Markov networks 
the factorisation is unique; different graph structures correspond to different probability 
distributions. This is not so in Bayesian networks, where DAGs can be grouped into 
equivalence classes which are statistically indistinguishable. Each such class is uniquely 
identified by the underlying UG (i.e. in which arc directions are disregarded, also 
known as skeleton) and by the set of v-structures (i.e. converging connections of the 
form v.i — > Vj <— Vk, i ¥= j ¥= fc, in which Vi and are not connected by an arc) common 
to all elements of the class. 

As for the global and the local distributions, there are many possible choices depend- 
ing on the nature of the data and the aims of the analysis. However, literature have 
focused mostly on two cases: the discrete case (Whittaker 1990; Heckerman et al. 1995), 
in which both the global and the local distributions are multinomial random variables, 
and the continuous case (Whittaker 1990; Gcigcr and Heckerman 1994), in which the 
global distribution is multivariate normal and the local distributions are univariate (in 
Bayesian networks) or multivariate (in Markov networks) normal random variables. In 
the former, the parameters of interest O are the conditional probabilities associated with 
each variable, usually represented as conditional probability tables. In the latter, the 
parameters of interest Q are the partial correlation coefficients between each variable 
and its neighbours in Q. Conjugate distributions (Dirichlet and Wishart, respectively) 
are then used for learning and inference in a Bayesian setting. 
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2 Multivariate discrete distributions 

The choice of an appropriate probability distribution for the set £ of the possible edges 
is crucial to make the derivation and the interpretation of the properties of £ and Q{£ ) 
easier. We will first note that a graph is uniquely identified by its edge set E (or by 
its arc set A for a DAG), and that each edge or arc is uniquely identified by the 
nodes Wj and Vj, i ¥= j it is incident on. Therefore, if we model £ with a random variable 
we have that any edge set E (or arc set A) is just an element of its sample space; and 
since there is a one-to-one correspondence between graphs and edge sets, probabilistic 
properties and inferential results derived for traditional graph-centric approaches can 
easily be adapted to this new edge-centric approach and vice versa. In addition, if we 
denote £ = {(Vi,Vj),i ^ j}, we can clearly see that \£\ = C(|V| 2 ). On the other hand, 
|G| = C(2l v l' ) for UGs and even larger for DAGs (Robinson 1973; Harary and Palmer 
1973) and their equivalence classes (Gillispie and Perlman 2002). 

We will also note that an edge or an arc has only few possible states: 

• an edge can be either present (e^ 6 E) or missing from an UG (eij $ E); 

• in a DAG, an arc can be present in one of its two possible directions (S7j e A or 
Oij £ A) or missing from the graph (Sy ^ A and ay ^ A). 

This leads naturally to the choice of a Bernoulli random variable for the former, 

!1 e E with probability p^ 
$ E with probability 1 — p^ 

and to the choice of a Trinomial random variable for the latter, 



Aij 



— 1 Uij e A with probability pij 

Oij $ A with probability pij , (4) 

1 Oij e A with probability p^ 



where Oij is the arc Vi — > Vj and is the arc Vj — > x>i. Therefore, a graph structure 
can be modelled through its edge or arc set as follows: 



• UGs, such as Markov networks or the skeleton and the moral graph of Bayesian 
networks (Pearl 1988), can be modelled by a multivariate Bernoulli random vari- 
able; 

• directed graphs, such as the DAGs used in Bayesian networks, can be modelled 

by a multivariate Trinomial random variable. 



In addition to being the natural choice for the respective classes of graphs, these 
distributions integrate smoothly with and extend other approaches present in litera- 
ture. For example, the probabilities associated with each edge or arc correspond to the 
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confidence coefficients from Friedman ct al. (1999a) and the arc strengths from Imoto 
et al. (2002). In a frequentist setting, they have been estimated using bootstrap resam- 
pling (Efron and Tibshirani 1993); in a Bayesian setting, Markov chain Monte Carlo 
(MCMC) approaches (Friedman and Koller 2003; Mclangon and Fabricc 2004) have 
been used instead. 



2.1 Multivariate Bernoulli 

Let B\ , . . . , Bk , k e N be Bernoulli random variables with marginal probabilities of 
success pi,...,pk, that is Bi ~ Ber(p{), i = l,...,k. Then the distribution of the 
random vector B = [Bi, . . . , Bk] T over the joint probability space of B\, . . . , Bk is a 
multivariate Bernoulli random variable (Krummenauer 1998), denoted as Berk(p). Its 
probability function is uniquely identified by the parameter collection 

p = {pi : I<= {l,...,fc}, 0}, 

which represents the dependence structure among the Bi in terms of simultaneous suc- 
cesses for every non-empty subset / of elements of B. Other characterisations and 
fundamental properties of the multivariate Bernoulli distribution can be found in John- 
son et al. (1997). Kocherlakota and Kocherlakota (1992) focus on the bivariate models 
specific to _Ber 2 (p). Additional characterisations and results specific to particular ap- 
plications can be found in George and McCulloch (1997, variable selection), Farrell and 
Rogers-Stewart (2008, longitudinal studies), Rubinstein (1999, combinatorial optimisa- 
tion) and Agresti and Klingenberg (2005, clinical trials), among others. 

From literature we know that the expectation and the covariance matrix of B arc 
immediate extensions of the corresponding univariate Bernoulli ones; 

E(B) = [pi, . . . p k ] T and COV(B) = [cry] = p {j - pipj. 

In particular, the covariance matrix E = [c^] has some interesting numerical properties. 
From basic probability theory, we know its diagonal elements an are bounded in the 
interval [0, |]; the maximum is attained for Pi = \ 1 and the minimum for both pi = 
and Pi = 1. For the Cauchy-Schwarz theorem then \a^\ e [0, jl. As a result, we can 
derive similar bounds for the eigenvalues Ai,...,Afe of E, as shown in the following 
theorem. 

Lemma 2.1. Let B ~ Berk(p), and let E be its covariance matrix. Let Aj, i = 1, . . . , k 

be the eigenvalues o/E. Then 

x^i k k 
;$ 2j ^ J and 0sgA 4 <-. 

i— 1 

Proof. See Appendix 1. □ 
These bounds define a closed convex set in K fc , described by the family 

£=L k - 1 (c):ce [o, ^ 
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where A fc_1 (c) is the non-standard k — 1 simplex 



A k -\c)= U\ 1 ,...,X k )eR k :^X i = c,X i >0\. (5) 
2.2 Multivariate Trinomial 

Construction and properties of the multivariate Trinomial random variable are similar 
to the ones illustrated in the previous section for the multivariate Bernoulli. For this 
reason, and because it is a particular case of the multivariate multinomial distribution, 
the multivariate Trinomial distribution is rarely the focus of research efforts in literature. 
Some of its fundamental properties are covered either in Johnson et al. (1997) or in 
monographs on contingency tables analysis such as Bishop et al. (2007). 

Let Ti,...,Tfc, k e N be Trinomial random variables assuming values {—1,0,1} 
and denoted as T; - Tri (Pi(_i),Pi(o),Pi(i)) with + p i(0) + p i(1) = 1. Then the 

distribution of the random vector T = [Ti, . . . ,lfc] T over the joint probability space 
of T\, . . . , Tfc is a multivariate Trinomial random variable, denoted as Trifc(p). The 
parameter collection p which uniquely identifies the distribution is 

P - ^p J(T) :/s{l,...,fc}, Te X {-1,0,1},/*.- 

»=i 

and the reduced parameter collection we will need to study its first and second order 
moments is 

P = {fti(T) -hi = !,-••, k,Te {-1,0, l} 2 }. 
From the definition, we can easily derive the expected value and the variance of Tj, 
E(Ti) = Pi(i) -Pt(-i) 



VAR(Ti) = p <(1) + p <( _!) - [p i(1) - p l( _ 1} ] 



2 



and the covariance between two variables Tj and Tj, 

COV(r i ,T j ) = [pij(i,i) - Pi(i)Pj(i) ] + [Py(-i,-i) - Pi(_i)Pj(_i) ] - 

- [P<j(-i,i) -ft(-i)Pj(i)] - [Ptf(l-l) -Pi(i)Pj(-i)]- 

Again, the diagonal elements of the covariance matrix £ are bounded. This can be 
proved either by solving the constrained maximisation problem 

max VAR(Ti) s.t. Pi(i) > 0, Pi(-i) > 0, Pi(i) + Pi(-i) ^ 1 

or as an application of the following theorem by Moors and Muilwijk (1971). 
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Theorem 2.1. If a discrete random variable X can take values only in the segment 
\x\,Xfi\ of the real axis, the maximum standard deviation of X equals \{x n —x-y). The 
maximum is reached if X takes the values x\ and x n with probabilities | each. 

Proof. See Moors and Muilwijk (1971). □ 

In both cases we obtain that the maximum variance is achieved for Pin) = Pi(-i) — \ 
and is equal to 1, so an e [0, 1] and \o~ij\ e [0, 1]. Furthermore, we can also prove that 
the eigenvalues of E are bounded using the same arguments as in Lemma 2.1. 

Lemma 2.2. Let T ~ Trifc(p), and let E be its covariance matrix. Let \i, i = 1, . . . , k 

be the eigenvalues o/E. Then 

k 

«S Yj Xi k and sg Aj ^ fc. 

i=l 

Proof. See the proof of Lemma 2.1 in Appendix 1. □ 

These bounds define again a closed convex set in K fe , described by the family 

C={A k -\c):ce[0,k]}, 

where A fc_1 (c) is the non-standard k — 1 simplex from Equation 5. 

Another useful result, which we will use in Section 3.2 to link inference on UGs and 
DAGs, is introduced below. 

Theorem 2.2. Let T - Tri k (p); then |T| = B ~ Ber k (p*) and 
\Ti\=Bi~ Ber{p*). 

Proof. See Appendix 1. □ 

It follows that the variance of each Tj can be decomposed in two parts: 

VAR(Tj) = VAR(Bj) + ^p i{X )P i{ -xy (6) 

The first is a function of the corresponding component |Tj| = Bi of the transformed 
random vector, while the second depends only on the probabilities associated with — 1 
and 1 (which correspond to Sy and aTj in Equation 4). 
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3 Properties of P(£(£)) and I V) 

The results derived in the previous section provide the foundation for characterising 
P(G(£)) and P(Q(£)\T>). To this end, it is useful to distinguish three cases corre- 
sponding to different configurations of the probability mass among the graph structures 
Q{£) e G: 

• minimum entropy: the probability mass is concentrated on a single graph struc- 
ture. This is the best possible configuration for P{Q{£ ) \ T>), because only one edge 
set E (or one arc set A) has a non-zero posterior probability. In other words, the 
data T> provide enough information to identify a single graph Q with posterior 
probability f; 

• intermediate entropy: several graph structures have non-zero probabilities. This is 
the case for informative priors P(Q(£)) and for the posteriors P(Q(£) | T>) resulting 
from real-world data sets; 

• maximum entropy: all graph structures in G have the same probability. This 
is the worst possible configuration for P(Q(£) \T>), because it corresponds to the 
non-informative prior from Equation 2. In other words, the data T> do not provide 
any information useful in identifying a high-posterior graph Q. 

Clearly, minimum and maximum entropy are limiting cases for P(Q(£) \ T>); the former 
is non-informative about Q{£), while the latter identifies a single graph in G. As we 
will show in Sections 3.1 (for UGs) and 3.2 (for DAGs), they provide useful reference 
points in determining which edges (or arcs) have significant posterior probabilities and 
in analysing the variability of the graph structure. 

3.1 Undirected graphs 

In the minimum entropy case, only one configuration of edges E has non-zero probabil- 
ity, which means that 

f I if eij e E 

Pii = < and £ = O. 

I otherwise 

The uniform distribution over G arising from the maximum entropy case has been 
studied extensively in random graph theory (Bollobas 2001); its two most relevant 
properties are that all edges are independent and have pij = \. As a result, £ = 
all edges display their maximum possible variability, which along with the fact that they 
are independent makes this distribution non-informative for £ as well as Q{£)- 

The intermediate entropy case displays a middle-ground behaviour between the min- 
imum and maximum entropy cases. The expected value and the covariance matrix of 
£ do not have a definite form beyond the bounds derived in Section 2.1. When consid- 
ering posteriors arising from real-world data, we have in practice that most edges in £ 



Marco Scutari 



represent conditional dependence relationships that are completely unsupported by the 
data. This behaviour has been explained by Pearl (2009) with the tendency of "good" 
graphical models to represent the causal relationships underlying the data, which are 
typically sparse. As a result, we have that E(ejj) = and VAR(e, J ) = for many ey, so 
S is almost surely singular unless such edges are excluded from the analysis. Edges that 
appear with p^ ~ | have about the same marginal probability and variance as in the 
maximum entropy case, so their marginal behaviour is very close to random noise. On 
the other hand, edges with probabilities near or 1 can be considered to have a good 
support (against or in favour, respectively). As pij approaches or 1, approaches 
its minimum entropy. 

The closeness of a multivariate Bernoulli distribution to the minimum and maximum 
entropy cases can be represented in an intuitive way by considering the eigenvalues 
A = [Ai, . . . , Afc] T of its covariance matrix E. Recall that the A can assume values in 
the convex set C defined in Equation 5, which corresponds to the region of the first 
orthant delimited by the non-standard simplex A fe_1 (4). In the minimum entropy case 
we have that £ = O, so Ai = . . . = A& = 0, and in the maximum entropy case £ = \lk, 
so Ai = . . . = \k = \\ both points lie on the boundary of C, the first in the origin and 
the second in the middle of A fc_1 (|). The distance between A and these two points 
provides an intuitive way of measuring the variability of £ and, indirectly, the entropy 
of the corresponding probability distributions P(Q(£) | T>) and ¥(Q(£ )). It is important 
to note, however, that different distributions over G may have identical first and second 
order moments when modelled through £. Such distributions will have the same A and 
will therefore map to the same point in C. 

A simple example comprising three different distributions over a set of two edges is 
illustrated below. 

Example 3.1. Consider three multivariate Bernoulli distributions B 1; B 2 , B 3 over 
two edges (denoted with e\ ~ E± and e 2 ~ E 2 for brevity) with covariance matrices 

0.1056 0.1456" 
0.1456 0.2016 

and eigenvalues 

0.3069 
0.0003 ' 

Their positions in C are shown in Figure 1. Bi is the closest to Q, |), the point 
corresponding to the maximum entropy case, while B 2 and B3 are farther from (z, j) 
than Bi due to the increasing correlation between e\ and e 2 (which are independent 
in the maximum entropy case). The correlation coefficients for Bi, B 2 and B3 are 
COR Bl (E ll E 2 ) = 0.1666, CORb 2 (E u E 2 ) = -0.2303, COR B3 (E 1 ,E 2 ) = 0.9978, and 
they account for the increasing difference between the eigenvalues of each covariance 
matrix. In fact, £3 is nearly singular because of the strong linear relationship between 
ex and e 2 , and it is therefore very close to one of the axes delimiting the first quadrant. 



Si 



0.24 
0.04 



0.04 
0.24 



0.1056 
-0.0336 



-0.0336 
0.2016 



Ai 



0.28 
0.20 



0.2121 
0.095 
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(0,|) 




Figure 1: The covariance matrices Si, S2 and S3 from Example 3.1 represented as 
functions of their eigenvalues in the convex set C. The points (0, 0) and (2,2) correspond 
to the minimum entropy and maximum entropy cases. 

If we denote with E 00 = {0}, E i = {e2], E w = {e\}, and En = {ei,e 2 } all possible 
edge sets and with poo, poi, pio and pn the associated probabilities, for Bi we have 

Poo = 0.2, poi = 0.2, p 10 = 0.2 and p n = 0.4. 

This is indeed close to a uniform distribution. The probability of both e\ and e 2 is 
0.6 and the variance is 0.24, which are again similar to the reference values for the 
maximum entropy case. On the other hand, for B 2 we have 

Poo = 0, poi = 0.12, pio = 0.28 and p u = 0.6. 

These probabilities are markedly different from a uniform distribution; the probabilities 
of e\ and e 2 are respectively 0.88 and 0.72. Considering also the correlation between 
e\ and e 2; it is intuitively clear why S 2 is not as close as Si to f~, ~J. This is also 
true for B3, which has the same marginal distributions as B 2 but with a much stronger 
correlation. 

3.2 Directed acyclic graphs 

The behaviour of the multivariate Trinomial distribution in the minimum and interme- 
diate entropy cases is similar to the one of the multivariate Bernoulli in many respects, 
but presents profound differences in the maximum entropy case. The reason for these 
differences is that the structure of a Bayesian network is assumed to be acyclic. There- 
fore, the state of each arc (i.e. whether is present in the DAG and its direction) is 
influenced by the state of all other possible arcs even in the maximum entropy case, 
when otherwise they would be independent. Furthermore, the acyclicity constraint can- 
not be written in closed form, making the derivation of exact results on the moments 
of the distribution of £ particularly difficult. 
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To obtain some simple expressions for the expected value and the covariance matrix, 
we will first prove a simple theorem on DAGs, which essentially states that if we reverse 
the direction of every arc the resulting graph is still a DAG. 

Theorem 3.1. Let G = (V, A) be a DAG, and let G* = (V, A*) another directed graph 
such that 

a~ij e A* <^^> tt[j e A and 
for every a>ij e A. Then G* is also acyclic. 

Proof. See Appendix 1. □ 

An immediate consequence of this theorem is that for every DAG including the arc 
Wij there exists another DAG including the arc Sy. Since all DAGs have the same 
probability in the maximum entropy case, this implies that both directions of every arc 
have the same probability 

for every possible a,j,i ^ j. (7) 
Then the expected value of each marginal Trinomial distribution is equal to 

E(Aij) = p~ij -fi-j = 

and its variance is equal to 

VAR(A i:) ) = p~ij +p~j - ii'.j )',, '- = 2pT r 

The joint probabilities associated with each pair of arcs also symmetric in the maxi- 
mum entropy case, again due to Theorem 3.1. Denote with a\j the event that arc ay is 
not present in the DAG. If we consider that both directions of every arc have the same 
probability and that there is no explicit ordering among the arcs, we have 

P(o^,om) = P(S7j,Sfci), P(a7j, &fci) = P(ajj, ojy), 

a~ki) = P(a~ij,a°ki) = P(ay,Sfe]) = P(ajj,a°ki). 

Then the expression for the covariance simplifies to 

COV(Ay ,A fci ) = 2 [P(a7], oftj) - P(o7), Sw)] , 

which can be interpreted as the difference in probability between a serial connection 
(i.e. Vi — > Vj — » vi, if j — k) and a converging connection (i.e. vi — > Vj <— vi) if the 
arcs are incident on a common node (Jensen and Nielsen 2007). This is interesting be- 
cause v-structures are invariant within equivalence classes, while other patterns of arcs 
are not (Chickering 1995); indeed, equivalence classes are usually represented as par- 
tially directed acyclic graphs (PDAGs) in which only arcs belonging to v-structures are 
directed. All other arcs, with the exclusion of those which could introduce additional v- 
structures (known as compelled arcs), are replaced with the corresponding (undirected) 



tin ~q £ j4 ^ > ' Qji £ -A 



Pij — Pij 
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3 



4 



5 



6 



7 



number of nodes 



Figure 2: Exact (dashed line) and approximate (solid line) probabilities of an arc being 
present in a DAG with 3, 4, 5, 6, and 7 nodes. The dotted line represents the limiting 
value in the number of nodes. 



edges. Therefore, the combination of high values of |COV(j4jj, A)-i)\ and ph is indica- 
tive of the belief that the corresponding arcs are directed in the PDAG identified by the 
equivalence class. Along with with VAR(Aij) and VAR(A/ C /), it is also indicative of the 
stability of the graph structure, both in the arcs and their directions. In an uninfor- 
mative prior, such as the distribution we are now considering in the maximum entropy 
case, we expect all covariances to be small; we will show this is the case in Theorem 
3.4. On the other hand, in an informative distribution such as the ones considered in 
the intermediate entropy case, we expect covariances to be closer to their upper bounds 
for arcs that are compelled or part of a converging connection, and closer to zero for 
arcs whose direction is not determined in the equivalence class. Note that the sign of 
CO\/(Aij, Af~i) depends on the way the two possible directions of each arc are associated 
with 1 and —1; a simple way to obtain a consistent paramctcrisation is to follow the 
natural ordering of the variables (i.e. if i ^ j then the arc incident on these nodes is 
taken to be A%j, a~ij is associated with 1 and &7J with —1). 

The equalities in Equations 7 and 8 drastically reduce the number of free parameters 
in the maximum entropy case. The marginal distribution of each arc now depends only 
on plj, whose value can be derived from the following numerical approximation by 
Melangon et al. (2000). 

Theorem 3.2. The average number of arcs in a DAG with n nodes is approximately 
jn 2 in the maximum entropy case. 



Proof. See Melangon et al. (2000). 



□ 
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Figure 3: Estimated (dashed line) and approximate (solid line) probabilities of an arc 
being present in a DAG with 8 to 50 nodes. The dotted line represents the limiting 
value in the number of nodes. 



Theorem 3.3. Let G = (V, A) be a DAG with n nodes. Then for each possible arc 
a-ijji j we have that in the maximum entropy case 



Pij ~ Vl] ~ 4 + 4(n - 1) 



and 



Pij 



2(n-l)' 



Proof. See Appendix 1. 



□ 



The quality of this approximation is examined in Figure 2 and Figure 3. In Figure 2, 
the values provided by Theorem 3.3 for DAGs with 3, 4, 5, 6 and 7 nodes are compared 
to the corresponding true values. The latter have been computed by enumerating all 
possible DAGs of that size (i.e. the whole population) and computing the relative 
frequency of each possible arc. In Figure 3, the values provided by Theorem 3.3 for DAGs 
with 8 to 50 nodes are compared with the corresponding estimated values computed 
over a set of 10 9 DAGs of the same size. The latter have been generated with uniform 
probability using the algorithm from Melangon and Fabrice (2004) as implemented in 
the bnlearn package (Scutari 2010, 2012) for R (R Development Core Team 2012). 

We can clearly see that the approximate values are close to the corresponding true 
(in Figure 2) or estimated (in Figure 3) values for DAGs with at least 6 nodes. This 
is not a significant limitation; the true values can be easily computed via exhaustive 
enumeration for DAGs with 3, 4 and 5 nodes (they are reported in Appendix 2, along 
with other relevant quantities). Furthermore, it is evident both from Theorem 3.3 and 
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from Figures 2 and 3 that, as the number of nodes diverges, 

lim = lim fe = \ and lim pL = ~. (9) 

If we take the absolute value of this asymptotic Trinomial distribution, the resulting 
random variable is Ber(pij) with p^ = i, which is the marginal distribution of an edge 
in an UG in the maximum entropy case. The absolute value transformation can be 
interpreted as ignoring the direction of the arc; the events e A and a~ij e A collapse 
into eij e E, while S^,a^ ^ A maps to E. As a result, the marginal distribution 
of an arc is remarkably similar to the one of the corresponding edge in an undirected 
graph for sufficiently large DAGs; in both cases, the nodes v t and Vj are linked with 
probability |. 

No result similar to Theorem 3.2 has been proved for arbitrary pairs of arcs in a 
directed acyclic graph; therefore, the structure of the covariance matrix E can be derived 
only in part. Variances can be approximated using the approximate probabilities from 
Theorem 3.3: 

VAR(A n ) = 2p~* ~ - + — r - as n -> oo. (10) 

v vj 2 2(n-l) 2 v ' 

Therefore, maximum variance (of each arc) and maximum entropy (of the graph 
structure) are distinct, as opposed to what happens in UGs. However, we can use the 
decomposition of the variance introduced in Equation 6 to motivate why the maximum 
entropy case is still a "worst case" outcome for P(Q(£) \ T>). As we can see from Figure 
4, the contributions of the presence of an arc (given by the transformation \ Aij\) and its 
direction (given by the 4pijpij = 4pij 2 term) to the variance are asymptotically equal. 
This is a consequence of the limits in Equation 9, which imply that an arc (modulo its 
direction) has the same probability to be present in or absent from the DAG and that 
its directions also have the same probability. As a result, we are not able to make any 
decision about either the presence of the arc or its direction. On the contrary, when 
VAR(Ay) reaches it maximum at 1 we have that P({a^j, Sij}) = 1 and P(a°ij) = 0, so we 
are sure that the arc will be present in the DAG in one of its two possible directions. 

As for the covariances, it is possible to obtain tight bounds using Hoeff 'ding's identity 
(Hoeffding 1940; Fisher and Sen 1994), 

COV(X,Y) = ^F x . Y {x,y) - F x (x)F Y (y)dxdy, (11) 

R 2 

and the decomposition of the joint distribution of dependent random variables provided 
by the Farlie-Morgenstern-Gumbel (FMG) family of distributions (Mari and Kotz 2001), 
which has the form 

F XtY (x,y) = F x (x)F Y (y)[l + e(l-F x (x))(l-F Y (y))], \e\ < 1. (12) 

In Equations 11 and 12, F XtY , F x and F Y denote the cumulative distribution functions 
of the joint and marginal distributions of X and Y, respectively. 
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0.0 0.2 0.4 0.6 0.8 1.0 

arc probability (modulo its direction) 



Figure 4: Decomposition of the asymptotic variance of an arc in the part that depends 
only on its presence (dashed line) and the part that depends only on its direction (solid 
line). The dots correspond to the respective values in the maximum entropy case. 




Figure 5: Bounds for the absolute value of the covariance and the correlation coefficient 
of two arcs in a DAG with 6 to 50 nodes. The dotted lines represent the respective 
limiting values. 



16 



Priors and Posteriors in Graphical Modelling 




20 



30 



number of nodes 



Figure 6: Approximate values for P(a^,Oki) (solid line) and P(a~ij,tiki) (dashed line) 
for DAGs with 8 to 50 nodes. The dotted line represents their asymptotic value. 



Theorem 3.4. Let G = (V, A) be a DAG, and let atj, i j and aj-i, k I be two 
possible arcs. Then in the maximum entropy case we have that 



and 



|COV(Ay,A w )| <4 



\COR{A ij ,A kl )\<2 
Proof. See Appendix 1. 



1 



4 4(n-l) 



4 4(n - 1) 



1 1 

+ 



4 4(n-l) 



1 1 

+ 



4 4(n-l) 



(13) 



(14) 



□ 



The bounds obtained from this theorem appear to be tight in the light of the true 
values for the covariance and correlation coefficients (computed again by enumerating 
all possible DAGs of size 3 to 7). Figure 5 shows the bounds for DAGs with 6 to 50 
nodes; for DAGs with 3, 4 and 5 nodes the approximation of pij the bounds are based 
on is loose, and the true values of covariance and correlation are known. Non-null 
covariances range from +0.08 (for DAGs with 3 nodes) to +0.08410 (for DAGs with 
7 nodes), while non-null correlation coefficients vary from +0.125 (for DAGs with 3 
nodes) to +0.1423 (for DAGs with 7 nodes). Both covariance and correlation appear to 
be strictly increasing in modulus as the number of nodes increases, and converge to the 
limiting values of the bounds (0.140625 and 0.28125, respectively) from below. 

Some other interesting properties are apparent from true values of the covariance 
coefficients reported in Appendix 2. They are reported below as conjectures because, 
while they describe a systematic behaviour that emerges from the DAGs whose sizes we 
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have a complete enumeration for, we were not able to substantiate them with formal 
proofs. 

Conjecture 3.1. Arcs that are not incident on a common node are uncorrelated. 



This is a consequence of the fact that if we consider Aij and A^i with i ¥= j ¥= k ^ I, 
we have P(a~^,a~ki) = P(a~ij,tiki). Therefore COV (A^ , Aki) = 0. This property seems to 
generalise to DAGs with more than 7 nodes. Figure 6 shows approximate estimates for 
P(&y ; Ofc;) arL d P(a~ij, Sfe/) for DAGs with 8 to 50 nodes, obtained again from 10 9 DAGs 
generated with uniform probability. The curves for the two probabilities are overlapping 
and very close to each other for all the considered DAG sizes, thus supporting Conjecture 
3.1. 

Conjecture 3.2. The covariance matrix £ is sparse. 



The proportion of arcs incident on a common node converges to zero as the number 
of nodes increases; therefore, if we assume Conjecture 3.1 is true, the proportion of 
elements of £ that are equal to has limit 

(?)( n ? 2 ) (n-2)(n-3) , s 

1 > lim /n \ 2 n \ 2 L > lim ^ A ' = 1. 15 

Furthermore, even arcs that are incident on a common node are not strongly correlated. 

Conjecture 3.3. Both covariance and correlation between two arcs incident on a com- 
mon node are monotonically increasing in modulus. 

Conjecture 3.4. The covariance between two arcs incident on a common node takes 
values in the interval [0.08, 0.140625] in modulus, while the correlation takes values in 
[0.125,0.28125] m modulus. 

These intervals can be further reduced to [0.08410,0.140625] and [0.1423, 0.28125] 
for DAGs larger than 7 nodes due to Conjecture 3.3. 

As far as the other two cases are concerned, in the minimum entropy case we have 
that 



— 1 if txij e A 

if a~Tj , Oij $ A and S = O 

1 if Oij e A 



as in the minimum entropy case of UGs. The intermediate entropy case again ranges 
from being very close to the minimum entropy case (when the graph structure displays 
little variability) to being very close to the maximum entropy case (when the graph 
structure displays substantial variability) . The bounds on the eigenvalues of S derived 
in Lemma 2.2 allow a graphical representation of the variability of the network structure, 
equivalent to the one illustrated in Example 3.1 for UGs. 
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4 Measures of variability 

Several functions have been proposed in literature as univariate measures of spread of 
a multivariate distribution, usually under the assumption of multivariate normality; for 
some examples see Mardia et al. (1979) and Bilodeau and Brenner (1999). Three of 
them in particular can be used as descriptive statistics for the multivariate Bernoulli 
and Trinomial distributions: the generalised variance, 

VAR G (E) = det(E); 

the total variance, 

VAR T (E) = tr(S); 

and the squared Frobenius matrix norm of the difference between E and a target matrix 

VAR jF (E,*) = |||£-*|||f,. 



Both generalised variance and total variance associate high values of the statistic to 
unstable network structures, and are bounded due to the properties of the multivari- 
ate Bernoulli and Trinomial distributions. For total variance, it is easy to show that 
cither VAR T (E) e [0, f] (for the multivariate Bernoulli) or VAR T (E) e [0, k] (for the 
multivariate Trinomial), due to the bounds on the variances an and on the eigenval- 
ues Aj derived in Sections 2.1 and 2.2. Generalised variance is similarly bounded due to 
Hadamard's theorem on the determinant of a non- negative definite matrix (Sober 2008): 
VAR G (E) e [0, (l) k ] for the multivariate Bernoulli distribution and VAR G (E) e [0, 1] for 
the multivariate Trinomial. They reach the respective maxima in the maximum entropy 
case and are equal to zero only in the minimum entropy case. Generalised variance is 
also strictly convex, but it is equal to zero when E is rank deficient. For this reason 
it may be convenient to reduce E to a smaller, full rank matrix (say S*) and consider 
VAR G (E*) instead of VAR G (£); using a regularised estimator for E such as the one 
presented in Ledoit and Wolf (2003) is also a viable option. 

The behaviour of the squared Frobenius matrix norm, on the other hand, depends 
on the choice of the target matrix For "J = O (the covariance matrix arising from the 
minimum entropy case for both the multivariate Bernoulli and the multivariate Trino- 
mial), VARi?(E, *i>) associates high values of the statistic to unstable network structures, 
like VARt(E) and VAR G (E); however, VARi?(E,0) does not have a unique maximum 
and none of its maxima corresponds to the maximum entropy case, making its inter- 
pretation unclear. A better choice seems to be a multiple of the covariance matrix 
arising from the maximum entropy case, say ^ = kT, max , associating high values of 
VARj?(E, kYi max ) to stable network structures. For the multivariate Bernoulli, if we let 
* = jl k , VAR F (E, jl k ) can be rewritten as 
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It has both a unique global minimum (because it is a convex function), 
minVAR F [£, k I k ) = VAR F f ~I f 



'4^7 " * V4 k ~ U 4 J ~ 16 



and a unique global maximum, 

maxVAR F (e, £l fc ) = VAR F (0) = £ (J)' = 

which correspond to the maximum and minimum entropy covariance matrices, respec- 
tively. Similar results can be derived for the multivariate Trinomial distribution, using 
an approximate estimate for T> max based on the results presented in Section 3.2. 

All the descriptive statistics introduced in this section can be normalised as follows: 



VAR T (£) ™ ~ VAR G (E) 



VAR T (E) = / v ' VAR G (E) 



VAR F (E,fcE ma:c ) 



max s VAR T (E)' uy ' max E VAR G (E)' 

max s VAR F (£, kT, max ) - VAR F (E, kY, max ) 
max s VAR F (E, kY, max ) - min s VAR F (S, kT, max ) ' 



(16) 



These normalised statistics vary in the [0, 1] interval and associate high values to graphs 
whose structures display a high variability. Since they vary on a known and bounded 
scale, they are easy to interpret as absolute quantities (i.e. goodness-of-fit statistics) as 
well as relative ones (i.e. proportions of total possible variability). 

They also have a clear geometric interpretation as distances in £, as they can all be 
rewritten as function of the eigenvalues Ai, . . . , This allows, in turn, to provide an 
easy interpretation of otherwise complex properties of P(G(£)) and P(G(£) \T>) and to 
derive new results. First of all, the measures introduced in Equation 16 can be used to 
select the best learning algorithm A in terms of structure stability for a given data set V. 
Different algorithms make use of the information present in the data in different ways, 
under different sets of assumptions and with varying degrees of robustness. Therefore, 
in practice different algorithms learn different structures from the same data and, in 
turn, result in different posterior distributions on G. If we rewrite Equation 1 to make 
this dependence explicit, 

P(G(£) | V, A)cc P(G(£))P(V | G(£),A), 

and denote with the covariance matrix of the distribution of the edges (or the arcs) 
induced by P(G(£ ) \ T>,A), then we can choose the optimal structure learning algorithm 
A* as 

A* = argmin VAR T (IU) 
A 



or, equivalently, using VAR G (E^) or \/ARp(^A, kY, max ) instead of VARt(E^i). Such an 
algorithm has the desirable property of maximising the information gain from the data, 
as measured by the distance from the non-informative prior P(G(£)) in C. In other 



20 



Priors and Posteriors in Graphical Modelling 



words, A* is the algorithm that uses the data in the most efficient way. Furthermore, 
an optimal A* can be identified even for data sets without a "golden standard" graph 
structure to use for comparison; this is not possible with the approaches commonly 
used in literature, which rely on variations of Hamming distance (Jungnickcl 2008) and 
knowledge of such a "golden standard" to evaluate learning algorithms (see, for example 
Tsamardinos et al. 2006). 

Similarly, it is possible to study the influence of different values of a tuning parameter 
for a given structure learning algorithm (and again a given data set). Such parameters 
include, for example, restrictions on the degrees of the nodes (Friedman ct al. 1999b) 
and regularisation coefficients (Koller and Friedman 2009). If we denote these tuning 
parameters with r, we can again choose an optimal r* as 

r* = argmin VARt(S_^( t )). 

T 

Another natural application of the variability measures presented in Equation 16 is 
the study of the consistency of structure learning algorithms. It has been proved in 
literature that most of structure learning algorithms are increasingly able to identify a 
single, minimal graph structure as the sample size diverges (see, for example Chickcring 
2002). Therefore, P(Q(£)\T>) converges towards the minimum entropy case and all 
variability measures converge to zero. However, convergence speed has never been 
analysed and compared across different learning algorithms; any one of VAR<r(5]_4), 
VARg(S]_4) or VARj?(£_4, kY, max ) provides a coherent way to perform such an analysis. 

Lastly, we may use the variability measures from Equation 16 as basis to investigate 
different prior distributions for real-world data modelling and to define new ones. Rel- 
atively little attention has been paid in literature to the choice of the prior over G, and 
the uniform maximum entropy distribution is usually chosen for computational reasons. 
Its only parameter is the imaginary sample size, which expresses the weight assigned to 
the prior distribution as the size of an imaginary sample size supporting it (Heckerman 
et al. 1995). 

However, choosing a uniform prior also has some drawbacks. Firstly, Stock and 
Jaakkola (2002) and Stock (2008) have shown that both large and small values of the 
imaginary sample size have unintuitive effects on the sparsity of a Bayesian network 
even for large sample sizes. For instance, large values of the imaginary sample size may 
favour the presence of an arc over its absence even when both Y{Q{£)) and T> imply 
the variables the arc is incident on are conditionally independent. Secondly, a uniform 
prior assigns a non-null probability to all possible models. Therefore, it often results 
in a very flat posterior which is not able discriminate between networks that are well 
supported by the data and networks that are not (Koller and Friedman 2009). 

Following Pearl (1988) 's suggestion that "good" graphical models should be sparse, 
sparsity-inducing priors such as the ones in Buntine (1991) and Friedman and Koller 
(2003) should be preferred to the maximum entropy distribution, as should informative 
priors (Mukherjee and Speed 2008). For example, the prior proposed in Buntine (1991) 
introduces a prior probability /3 to include (independently) each arc in a Bayesian net- 
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work with a given topological ordering, which means = f3 and pij = for all i < j in 
P(G(£)). Thus, VAR(Ay) = - /3 2 , VAR T (E) = - /? 2 ) and VAR G (E) = (/3 - /? 2 ) fe . 
The prior proposed in Friedman ct al. (1999a), on the other hand, controls the num- 
ber of parents of each node for a given topological ordering. Therefore, it favours low 
values of P(a^,Sjfc) in P(Q(£)) and again pij = for all i < j. Clearly, the amount 
of sparsity induced by the hyperparameters of these priors determines the variability of 
both the prior and the posterior, and can be controlled through the variability measures 
from Equation 16. Furthermore, these measures can provide inspiration in devising new 
priors with the desired form and amount of sparsity. 

5 Conclusions 

Bayesian inference on the structure of graphical models is challenging in most situations 
due to the difficulties in defining and analysing prior and posterior distributions over 
the spaces of undirected or directed acyclic graphs. The dimension of these spaces grows 
supcr-exponcntially in the number of variables considered in the model, making even 
MAP analyses problematic. 

In this paper, we propose an alternative approach to the analysis of graph structures 
which focuses on the set of possible edges £ of a graphical model M. = (Q(£),Q) instead 
of the possible graph structures themselves. The latter are uniquely identified by the 
respective edge sets; therefore, the proposed approach integrates smoothly with and 
extends both frcqucntist and Bayesian results present in literature. Furthermore, this 
change in focus provides additional insights on the behaviour of individual edges (which 
are usually the focus of inference) and reduces the dimension of the sample space from 
super-exponential to quadratic in the number of variables. 

For many inference problems the parameter space is reduced as well, and makes 
complex inferential tasks feasible. As an example, we characterise several measures of 
structural variability for both Bayesian and Markov networks using the second order 
moments of P(Q(£)) and P(Q(£) | T>). These measures have several possible applications 
and are easy to interpret from both an algebraic and a geometric point of view. 
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1 Proofs 

Proof of Lemma 2.1. Since E is a real, symmetric, non-negative definite matrix, its 
eigenvalues Xi are non-negative real numbers; this proves the lower bound in both 
inequalities. 
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The upper bound in the first inequality holds because 

2 Ai = 2 cr« < max ^ cr ij; = ^ maxer^ = -, 
as the sum of the eigenvalues is equal to the trace of E. This in turn implies 

i—l 

which completes the proof. □ 

Proof of Theorem 2.2. It is easy to show that each \Ti\ = Bi, with p^u + p,(_i) = p* 
and Pi{o) = 1 — P*- It follows that the parameter collection p of T reduces to 

P* = (p/(T) e {1, ...,£}, T e {0, 1} 1 ", 
= {p / :/E{l,... > *},/#0} 

after the transformation. Therefore, |T| ~ Berfc(p*) is a uniquely identified multivariate 
Bernoulli random variable according to the definition introduced at the beginning of 
Section 2.1. □ 

Proof of Theorem 3.1. Let's assume by contradiction that G* is cyclic; this implies that 
there are one or more nodes ^eV such that 

«ij Oki 

Vi > Vj ■ -> . . . -* v k * Vi 

for some Vj, v k G V. However, this would mean that in G we would have 

Vi * Vk -* . . • -» Vj * Vi 

which is not possible since G is assumed to be acyclic. □ 

Proof of Theorem 3. 3. Each possible arc can appear in the graph in only one direction 
at a time, so a directed acyclic graph with n nodes can have at most (™) = \n{n — 1) 
arcs. Therefore 

PH+Wj - i n{n _ l) - 2 + 2(„-l)' 
But in the maximum entropy case we also have that ptj = pi], so 

^ = ^4 + 4(^1) and ^ = l - 2 ^^\-^hiy 

which completes the proof. □ 
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Proof of Theorem 3.4- In the maximum entropy case, all arcs have the same marginal 
distribution function, 



F A (a,ij) < 



f 
1 1 

4 + 4(n - 1) 
3 1 



in (—oo,—l] 
in (-1,0] 

in (0, 1] 



(17) 



4 4(n-l) 
1 in (1, +oo ) 



so the joint distribution of any pair of arcs a^- and ciki can be written as a member of 
the Farlie-Morgenstern-Gumbel family of distribution as 

^A.K.aw) = F A (a. l3 )F A (a kl )[l + e(l - ^K))(l - F A (a kl ))]. (18) 

Then if we apply Hoeffding's identity from Equation 11 and replace the joint distri- 
bution function F AijAkl (ai 3 ,a k i) with the right hand of Equation 18 we have that 

\Z0V{A l3 ,A kl )\ = 



Y Y F A tJ Ak,(^j, a ki) - F A { aij )F A (a k i) 

{-1,0,1} {-1,0,1} 

Y Y \ F A ijt Aki( a ij' a kl) ~ F A (Oij)F A (a k l)\ 
{-1,0,1} {-1,0,1} 

Y Y \F A (a lJ )F A (a kl )[l+ 

{-1,0,1} {-1,0,1} 

+e(l - F A { aij ))(l - F A (a kl ))] - F A (a tJ ) F A (a u ) 
Y Y a-F A ( aij ))(l-F A (a kl )). 

{-1,0} {-1,0} 



<: 



We can now compute the bounds for |C0V(ay, a k i)\ and |COR(ajj, aki)\ using only 
the marginal distribution function F A from Equation 17 and the variance from Equation 
10, thus obtaining the expressions in Equation 13 and Equation 14. □ 



2 Moments and parameters of the multivariate Trinomial 
distribution in the maximum entropy case 

Below are reported the exact values of the parameters of the marginal Trinomial dis- 
tributions and of the first and second order moments of the multivariate Trinomial 
distribution in the maximum entropy case. All these quantities have been computed by 
a complete enumeration of the directed acyclic graphs of a given size (3, 4, 5, 6 and 7). 
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Moments for the 3-dimensional distribution 



Aij 



-1 with probability 0.32 E(A 4J ) = 

with probability 0.36 VAR(A ij ) = 0.64 

1 with probability 0.32 |C0V(^-, A u )\ = 0.08 



Moments for the 4-dimensional distribution 

1 with probability 0.309392 

Ei„ = 

A i:j = \ with probability 0.381215 
1 with probability 0.309392 



VAR(A ij ) = 0.618784 



|COV(i4y,A fcI )| 



if i ^ j ^ k ± I 

0.081031 otherwise 



Moments for the 5-dimensional distribution 

1 with probability 0.301082 
Aij ■. = \ with probability 0.397834 
1 with probability 0.301082 



VAR(Aij) = 0.602165 



|COV(i4y,A fcI )| 



if i ^ j ^ k ± I 

0.081691 otherwise 



Moments for the 6-dimensional distribution 

- 1 with probability 0.294562 

Au = { with probability 0.410875 ^ , ~ 

' ' VAR(A„) = 0.589124 

1 with probability 0.294562 J/ 



|COV(i4y,A fcI )| = 



if i ^ j * k ^ I 

0.082121 otherwise 



Moments for the 7-dimensional distribution 

- 1 with probability 0.289390 

Au = { with probability 0.421220 ^ ,~ 

' VARA„ = 0.578780 

1 with probability 0.289390 J 
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|C0V(Ai,4w) 



if i ^ j ^ k I 

0.82410 otherwise 
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